Introduce atomic slot migration #1591

murphyjacob4 · 2025-01-20T09:50:36Z

Starting this PR in draft so we can review early and gather alignment.

Summary

The solution is mostly based on discussion and proposals in #23 :

Target driven: the user sends CLUSTER IMPORT to the target and it syncs down the slot
Uses AOF format for snapshot, proxying all commands directly from source primary through the target primary to all target replicas
Consensus-less epoch bump to broadcast the topology change

Import and Export Workflows

We utilize a new internal CLUSTER SYNCSLOTS command with additional sub-commands to transition through the slot migration state machine.

On either end of the migration, we track the ongoing migrations via two job queues: slot_import_jobs and slot_export_jobs. Right now, we only support one concurrent job in both the slot import and slot export job queue. There is no design restriction here (outside of perhaps some protocol additions to CLUSTER SYNCSLOTS - but it helps with simplicity for the first iteration.

Workflow overview

User sends CLUSTER IMPORT SLOTSRANGE <slot start> <slot end> ... to target node T
T initiates a new connection to source node S
If required, T runs AUTH based on replication configuration
T initiates CLUSTER SYNCSLOTS START to S
S begins tracking the client that sent the command as the slot export client. It spawns a child process at the next available time and runs AOF rewrite with just the specified slots. It then begins accumulating a backlog of writes in the slot export client output buffer, without installing the write handler.
At the end of the AOF rewrite, S also appends CLUSTER SYNCSLOTS ENDSNAPSHOT
T processes the AOF rewrite as it would any other client using readQueryFromClient. Once it gets the CLUSTER SYNCSLOTS ENDSNAPSHOT, T sends back a CLUSTER SYNCSLOTS PAUSE to pause S.
Upon getting the command, S unblocks the slot export client to T which has been accumulating ongoing writes. S then pauses itself, and sends CLUSTER SYNCSLOTS PAUSEOFFSET <offset> back to T with the current offset. (Note that the offset is not the primary replication offset, it is actually a computed offset based on how much we have been accumulating on T's client.)
T waits for it's replication offset to catch up to the sent offset, and once caught up executes the consensus-less bump.
S finds out about the bump via cluster gossip, unpauses itself, and cleans up dirty keys.

If at any point a client is disconnected on either end, or a timeout is reached on the target node, the migration is marked as failed. If a migration fails, we delete all keys in the slots we were migrating.

Filtering traffic

We filter the traffic to the target node through a filtered AOF rewrite and a filtered replication stream. The filtered AOF rewrite requires some refactoring of the snapshot code for reuse, but utilizes the same overall procedure (piping through the parent process to the target node connection).

The filtered replication stream hooks into the existing replication code and appends the commands directly to the client output buffer. We don't use replication backlog as there is no easy way to filter it once added to the backlog without re-processing it to query for the slot number of each command.

We add a new check in putClientInPendingWriteQueue to prevent the two command streams from merging. The parent process will just accumulate the replication stream in the client output buffer until we get the notification that the target is done with the snapshot.

Other notes

Since there are many checks that apply to both slot migration and replication - I added a new client flag "replicated" which simply means that a client is being replicated from another node. In many places, we now check the replicated flag instead of the primary flag.
To support non-contiguous slot ranges, the APIs and code use a slot bitmap to share information about what slots are being migrated.
We already delete unowned slots after loading RDB and AOF. Replicas that are promoted to primary now go through the same steps to delete all unowned keys. This prevents the keys from being leaked forever if the previous primary disappears.

Remaining work items

Continue cleaning up code, especially around the state machines
TCL tests
Metrics
Improved logging
Improved pause logic - where we only pause the slots being moved
Hide the in-progress import better for commands like KEYS and RANDOMKEY
Other management APIs (i.e. CLUSTER IMPORT STATUS, CLUSTER IMPORT CANCEL)
Incorporate dual-channel-style replication

Signed-off-by: Jacob Murphy <[email protected]>

codecov · 2025-01-20T10:29:12Z

Codecov Report

Attention: Patch coverage is 27.00730% with 400 lines in your changes missing coverage. Please review.

Project coverage is 70.58%. Comparing base (2a1a65b) to head (f1d824f).
Report is 12 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	10.73%	374 Missing ⚠️
src/kvstore.c	43.75%	9 Missing ⚠️
src/rdb.c	79.06%	9 Missing ⚠️
src/aof.c	50.00%	5 Missing ⚠️
src/io_threads.c	0.00%	1 Missing ⚠️
src/module.c	0.00%	1 Missing ⚠️
src/networking.c	96.77%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1591      +/-   ##
============================================
- Coverage     70.78%   70.58%   -0.21%     
============================================
  Files           120      121       +1     
  Lines         65046    65595     +549     
============================================
+ Hits          46045    46302     +257     
- Misses        19001    19293     +292

Files with missing lines	Coverage Δ
src/blocked.c	`91.38% <100.00%> (ø)`
src/cluster.c	`89.21% <100.00%> (ø)`
src/commands.def	`100.00% <ø> (ø)`
src/db.c	`89.57% <100.00%> (+<0.01%)`	⬆️
src/object.c	`82.05% <100.00%> (+0.01%)`	⬆️
src/replication.c	`87.63% <100.00%> (+0.18%)`	⬆️
src/server.c	`87.63% <100.00%> (+0.02%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/io_threads.c	`6.94% <0.00%> (ø)`
src/module.c	`9.59% <0.00%> (-0.01%)`	⬇️
... and 5 more

... and 14 files with indirect coverage changes

hwware · 2025-01-20T19:49:32Z

Do you have plan to implement the following 2 cases ?

CLUSTER IMPORT CANCEL
Source Primary failover during slot migration

murphyjacob4 · 2025-01-21T19:13:54Z

CLUSTER IMPORT CANCEL

All this needs to do is call freeClient on the slot migration source client and delete all keys in the slot bitmap of that migration. The source is tracking the slot migration and when the client close notification comes in - it frees its local tracking information. This is the same process that already occurs if a migration times out.

Source Primary failover during slot migration

The implementation plan would be:

Where we currently check the timeout on the target side, also check that the source is still primary, and if not cancel the operation (as discussed above)
On the source side, we would introduce a similar check, if we find out we are demoted, cancel the operation and close the slot migration link. Right now the source will just give up if that happens, however if we add retry in the future, subsequent retries to do CLUSTER SYNCSLOTS will fail with an error due to not being primary, which we would detect and fail the operation.

Note that due to consensus-less epoch bump - if there is a race between failover and slot migration - both may succeed, and later one of those will win deterministically based on the epoch collision protocol - so we will lose some writes on the epoch collision losing side due to a period of two nodes declaring primaryship. This is the same issue that exists in the current consensus-less slot migration implementation - and we are looking to address it as part of #1355 instead

Signed-off-by: Jacob Murphy <[email protected]>

murphyjacob4 · 2025-01-22T16:11:26Z

Hi folks, so @enjoy-binbin, @PingXie, and I discussed offline. Tencent has offered a solution that they have developed internally. Moving forward, @enjoy-binbin and I will join efforts and work on bridging gaps in the Tencent solution to meet the requirements we outlined in #23. Hopefully we will have a shared PR for review soon.

Given this, I am going to go ahead and close this pull request.

murphyjacob4 added 12 commits January 15, 2025 12:04

Initial commit of AOF based atomic slot migration

294dd94

Signed-off-by: Jacob Murphy <[email protected]>

Use slot bitmap everywhere

ac26e20

Signed-off-by: Jacob Murphy <[email protected]>

Iterative improvements to get dual channel working

6c83496

Signed-off-by: Jacob Murphy <[email protected]>

Refactor code to reduce touch points in replication.c

50878f5

Signed-off-by: Jacob Murphy <[email protected]>

Get tests passing

db18295

Signed-off-by: Jacob Murphy <[email protected]>

Refactor to use dedicated sync mechanisms

6e8bdb5

Signed-off-by: Jacob Murphy <[email protected]>

Bug fixes for SYNCSLOTS based implementation

98de0a7

Signed-off-by: Jacob Murphy <[email protected]>

Code cleanup

4784891

Signed-off-by: Jacob Murphy <[email protected]>

Cleanup debug log

b52d779

Signed-off-by: Jacob Murphy <[email protected]>

Rename CLUSTER MIGRATE to CLUSTER IMPORT

3b5e555

Signed-off-by: Jacob Murphy <[email protected]>

Fix implicit fallthrough

c59a7f7

Signed-off-by: Jacob Murphy <[email protected]>

Fix mac build

586b22e

Signed-off-by: Jacob Murphy <[email protected]>

murphyjacob4 mentioned this pull request Jan 20, 2025

[NEW] Atomic slot migration HLD #23

Open

Apply clang format

117bfa9

Signed-off-by: Jacob Murphy <[email protected]>

murphyjacob4 added 5 commits January 22, 2025 00:45

Fix offset accounting and handle source primary failover

eb4c1a8

Signed-off-by: Jacob Murphy <[email protected]>

Typo + mac build fix

4506fbe

Signed-off-by: Jacob Murphy <[email protected]>

Log line fix

3eec13c

Signed-off-by: Jacob Murphy <[email protected]>

Another log line fix

9021fff

Signed-off-by: Jacob Murphy <[email protected]>

Apply clang format

f1d824f

Signed-off-by: Jacob Murphy <[email protected]>

murphyjacob4 closed this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce atomic slot migration #1591

Introduce atomic slot migration #1591

murphyjacob4 commented Jan 20, 2025

codecov bot commented Jan 20, 2025 •

edited

Loading

hwware commented Jan 20, 2025

murphyjacob4 commented Jan 21, 2025

murphyjacob4 commented Jan 22, 2025

Introduce atomic slot migration #1591

Introduce atomic slot migration #1591

Conversation

murphyjacob4 commented Jan 20, 2025

Summary

Import and Export Workflows

Workflow overview

Filtering traffic

Other notes

Remaining work items

codecov bot commented Jan 20, 2025 • edited Loading

Codecov Report

hwware commented Jan 20, 2025

murphyjacob4 commented Jan 21, 2025

murphyjacob4 commented Jan 22, 2025

codecov bot commented Jan 20, 2025 •

edited

Loading